{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WfGpInivS0fG"
   },
   "source": [
    "<h2 align=\"center\">点击下列图标在线运行HanLP</h2>\n",
    "<div align=\"center\">\n",
    "\t<a href=\"https://colab.research.google.com/github/hankcs/HanLP/blob/doc-zh/plugins/hanlp_demo/hanlp_demo/zh/pos_stl.ipynb\" target=\"_blank\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\t<a href=\"https://mybinder.org/v2/gh/hankcs/HanLP/doc-zh?filepath=plugins%2Fhanlp_demo%2Fhanlp_demo%2Fzh%2Fpos_stl.ipynb\" target=\"_blank\"><img src=\"https://mybinder.org/badge_logo.svg\" alt=\"Open In Binder\"/></a>\n",
    "</div>\n",
    "\n",
    "## 安装"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IYwV-UkNNzFp"
   },
   "source": [
    "无论是Windows、Linux还是macOS，HanLP的安装只需一句话搞定："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1Uf_u7ddMhUt"
   },
   "outputs": [],
   "source": [
    "!pip install hanlp -U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pp-1KqEOOJ4t"
   },
   "source": [
    "## 加载模型\n",
    "HanLP的工作流程是先加载模型，模型的标示符存储在`hanlp.pretrained`这个包中，按照NLP任务归类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "4M7ka0K5OMWU",
    "outputId": "d74f0749-0587-454a-d7c9-7418d45ce534"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'CTB5_POS_RNN': 'https://file.hankcs.com/hanlp/pos/ctb5_pos_rnn_20200113_235925.zip',\n",
       " 'CTB5_POS_RNN_FASTTEXT_ZH': 'https://file.hankcs.com/hanlp/pos/ctb5_pos_rnn_fasttext_20191230_202639.zip',\n",
       " 'CTB9_POS_ALBERT_BASE': 'https://file.hankcs.com/hanlp/pos/ctb9_albert_base_20211228_163935.zip',\n",
       " 'CTB9_POS_ELECTRA_SMALL_TF': 'https://file.hankcs.com/hanlp/pos/pos_ctb_electra_small_20211227_121341.zip',\n",
       " 'CTB9_POS_ELECTRA_SMALL': 'https://file.hankcs.com/hanlp/pos/pos_ctb_electra_small_20220215_111944.zip',\n",
       " 'CTB9_POS_RADICAL_ELECTRA_SMALL': 'https://file.hankcs.com/hanlp/pos/pos_ctb_radical_electra_small_20220215_111932.zip',\n",
       " 'C863_POS_ELECTRA_SMALL': 'https://file.hankcs.com/hanlp/pos/pos_863_electra_small_20220217_101958.zip',\n",
       " 'PKU_POS_ELECTRA_SMALL': 'https://file.hankcs.com/hanlp/pos/pos_pku_electra_small_20220217_142436.zip',\n",
       " 'PKU98_POS_ELECTRA_SMALL': 'https://file.hankcs.com/hanlp/pos/pos_pku_electra_small_20210808_125158.zip',\n",
       " 'PTB_POS_RNN_FASTTEXT_EN': 'https://file.hankcs.com/hanlp/pos/ptb_pos_rnn_fasttext_20200103_145337.zip'}"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import hanlp\n",
    "hanlp.pretrained.pos.ALL # 语种见名称最后一个字段或相应语料库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BMW528wGNulM"
   },
   "source": [
    "调用`hanlp.load`进行加载，模型会自动下载到本地缓存："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "0tmKBu7sNAXX",
    "outputId": "df2de87b-27f5-4c72-8eb2-25ceefdd8270"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Downloading https://file.hankcs.com/hanlp/pos/ctb9_pos_electra_small_20220118_164341.zip to /root/.hanlp/pos/ctb9_pos_electra_small_20220118_164341.zip\n",
      "100%  43.6 MiB  21.2 MiB/s ETA:  0 s [=========================================]\n",
      "Decompressing /root/.hanlp/pos/ctb9_pos_electra_small_20220118_164341.zip to /root/.hanlp/pos\n",
      "Downloading https://file.hankcs.com/hanlp/transformers/electra_zh_small_20210706_125427.zip to /root/.hanlp/transformers/electra_zh_small_20210706_125427.zip\n",
      "100%  41.2 KiB  41.2 KiB/s ETA:  0 s [=========================================]\n",
      "Decompressing /root/.hanlp/transformers/electra_zh_small_20210706_125427.zip to /root/.hanlp/transformers\n"
     ]
    }
   ],
   "source": [
    "pos = hanlp.load(hanlp.pretrained.pos.CTB9_POS_ELECTRA_SMALL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "elA_UyssOut_"
   },
   "source": [
    "## 词性标注\n",
    "词性标注任务的输入为已分词的一个或多个句子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "BqEmDMGGOtk3",
    "outputId": "936d439a-e1ff-4308-d2aa-775955558594"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['PN', 'DEG', 'NN', 'VC', 'VV', 'NR', 'DEG', 'NN', 'LB', 'NR', 'VV', 'PU']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pos([\"我\", \"的\", \"希望\", \"是\", \"希望\", \"张晚霞\", \"的\", \"背影\", \"被\", \"晚霞\", \"映红\", \"。\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jj1Jk-2sPHYx"
   },
   "source": [
    "注意上面两个“希望”的词性各不相同，一个是名词另一个是动词。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "suUL042zPpLj"
   },
   "source": [
    "## 自定义词典\n",
    "自定义词典为词性标注任务的成员变量，以CTB标准为例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "AzYShIssP6kq",
    "outputId": "99b2607b-b618-4876-bbea-9f8c24859a85"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "print(pos.dict_tags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1q4MUpgVQNlu"
   },
   "source": [
    "自定义单个词性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "2zZkH9tRQOoi",
    "outputId": "4f92a907-10c3-4798-e7b9-914b8f577b2c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['state-of-the-art-tool',\n",
       " 'P',\n",
       " 'NN',\n",
       " 'NN',\n",
       " 'VV',\n",
       " 'JJ',\n",
       " 'NN',\n",
       " 'AD',\n",
       " 'VA',\n",
       " 'DEC',\n",
       " 'NN',\n",
       " 'NN',\n",
       " 'NN',\n",
       " 'PU']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pos.dict_tags = {'HanLP': 'state-of-the-art-tool'}\n",
    "pos([\"HanLP\", \"为\", \"生产\", \"环境\", \"带来\", \"次\", \"世代\", \"最\", \"先进\", \"的\", \"多语种\", \"NLP\", \"技术\", \"。\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "F-9gAeIVQUFG"
   },
   "source": [
    "根据上下文自定义词性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "F8M8cyBrQduw",
    "outputId": "24fa7ff0-305d-4d71-925e-f369b1c50e96"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['PN', '补语成分', '名词', 'VC', '动词', 'NR', 'DEG', 'NN', 'LB', 'NR', 'VV', 'PU']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pos.dict_tags = {('的', '希望'): ('补语成分', '名词'), '希望': '动词'}\n",
    "pos([\"我\", \"的\", \"希望\", \"是\", \"希望\", \"张晚霞\", \"的\", \"背影\", \"被\", \"晚霞\", \"映红\", \"。\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "需要算法基础才能理解，初学者可参考[《自然语言处理入门》](http://nlp.hankcs.com/book.php)。"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "pos_stl.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
